Linguistic annotation in / for corpus linguistics
نویسنده
چکیده
This article surveys linguistic annotation in corpora and corpus linguistics. We first define the concept of 'corpus' as a radial category and then, in Section 2, discuss a variety of kinds of information for which corpora are annotated and that are exploited in contemporary corpus linguistics. Section 3 then exemplifies many current formats of annotation with an eye to highlighting both the diversity of formats currently available and the emergence of XML annotation as, for now, the most widespread form of annotation. Section 4 summarizes and concludes with desiderata for future developments.
منابع مشابه
Towards an integrated representation of multiple layers of linguistic annotation in multilingual corpora
There has been an increasing interest in recent years in the enrichment of natural language corpora in terms of annotation with explicit linguistic information. This interest manifests itself most prominently in two areas of linguistics: corpus linguistics and computational linguistics. For corpus linguistics, the long standing practice has been to work on raw, i.e., unannotated text. While raw...
متن کاملLinguistic Annotation: from Links to Cross-Layer Lexicons
Lexicons have always been part of linguistics, the more in the era of computational linguistics. Complex, deep linguistic annotation has emerged as an important research phenomenon relatively recently. Even though various annotation schemes ([10], [13], [15], [16], [17]) have been developed containing some sort of explicit or implicit reference to a “lexicon”, none has presented a coherent and ...
متن کاملDetecting Annotation Errors in Spoken Language Corpora
Consistency of corpus annotation is an essential property for the many uses of annotated corpora in computational and theoretical linguistics. While some research addresses the detection of inconsistencies in part-of-speech and other positional annotation (van Halteren, 2000; Eskin, 2000; Dickinson and Meurers, 2003a), more recently work has also started to address errors in syntactic and other...
متن کاملAnnotating Discourse Anaphora
In this paper, we present preliminary work on corpus-based anaphora resolution of discourse deixis in German. Our annotation guidelines provide linguistic tests for locating the antecedent, and for determining the semantic types of both the antecedent and the anaphor. The corpus consists of selected speaker turns from the Europarl corpus.
متن کاملLinguistic Resources and Software for Shallow Processing
This paper presents linguistic resources and software composed by a hand-tagged corpus with 1 million tokens and several shallow processing annotation tools.
متن کامل